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Abstract — Given a large graph G = (V^, £) with millions of 
nodes and edges, how do we compute its connected components 
efficiently? Recent work addresses this problem in map-reduce, 
where a fundamental trade-off exists between the number of map- 
reduce rounds and the communication of each round. Denoting 
d the diameter of the graph, and n the number of nodes 
in the largest component, all prior techniques for map-reduce 
either require a linear, 6(d), number of rounds, or a quadratic, 
6(n|V + \E\), communication per round. 

We propose here two efficient map-reduce algorithms: (i) 
Hash-Greater-to-Min, which is a randomized algorithm based on 
PRAM techniques, requiring O(logn) rounds and 0(|1/ + \E\) 
communication per round, and (ii) Hash-to-Min, which is a 
novel algorithm, provably finishing in O(logn) iterations for 
path graphs. The proof technique used for Hash-to-Min is novel, 
but not tight, and it is actually faster than Hash-Greater-to- 
Min in practice. We conjecture that it requires 21ogd rounds 
and 3(1 V + \E\) communication per round, as demonstrated 
in our experiments. Using secondary sorting, a standard map- 
reduce feature, we scale Hash-to-Min to graphs with very large 
connected components. 

Our techniques for connected components can be applied to 
clustering as well. We propose a novel algorithm for agglomera- 
tive single linkage clustering in map-reduce. This is the first map- 
reduce algorithm for clustering in at most 0(log n) rounds, where 
n is the size of the largest cluster. We show the effectiveness of all 
our algorithms through detailed experiments on large synthetic 
as well as real-world datasets. 



I. Introduction 

Given a large graph G ~ {V, E) with millions of nodes 
and edges, how do we compute its connected components 
efficiently? With the proliferation of large databases of linked 
data, it has become very important to scale to large graphs. Ex- 
amples of large datasets include the graph of webpages, where 
edges are hyperlinks between documents, social networks that 
link entities like people, and Linked Open DatcQthat represents 
a collection of linked structured entities. The problem of 
finding connected components on such graphs, and the related 
problem of undirected s-t connectivity (USTCON lfT4l ) that 
checks whether two nodes s and t are connected, are funda- 
mental as they are basic building blocks for more complex 
graph analyses, Uke clustering. 

The number of vertices \V\ and edges |-E| in these graphs is 
very large. Moreover, such graphs often arise as intermediate 
outputs of large batch-processing tasks (e.g., clustering Web 
pages and entity resolution), thus requiring us to design 
algorithms in a distributed setting. Map-reduce has become 
a very popular choice for distributed data processing. In 
map-reduce, there are two critical metrics to be optimized - 
number of map-reduce rounds, since each additional job incurs 
significant running time overhead because of synchronization 

' http://linkeddata.org/ 



and congestion issues, and communication per round, since 
this determines the size of the intermediate data. 

There has been prior work on finding connected components 
iteratively in map-reduce, and a fundamental trade-off exists 
between the number of rounds and the communication per 
round. Starting from small clusters, these techniques iteratively 
expand existing clusters, by either adding adjacent one-hop 
graph neighbors, or by merging existing overlapping clusters. 
The former kind |]5|, IfTOl . IfTTI require Q{d) map-reduce 
rounds for a graph with diameter d, while the latter [1] require 
a larger, 8(n|V^| + |£'|), computation per round, with n being 
the number of nodes in the largest component. 

More efficient O(logn) time PRAM algorithms have been 
proposed for computing connected components. While theo- 
retical results simulating 0(log n) PRAM algorithms in map- 
reduce using 0(log n) rounds exist lfT2l . the PRAM algorithms 
for connected components have not yet been ported to a 
practical and efficient map-reduce algorithm. (See Sec. III-AI 
for a more detailed description) 

In this paper, we present two new map-reduce algorithms for 
computing connected components. The first algorithm, called 
Hash-Greater-to-Min, is an efficient map-reduce implementa- 
tion of existing PRAM algorithms ifTTI . 1201 . and provably 
requires at most 31ogn rounds with high probability, and 
a per round communication cosu of at most 2{\V\ + \E\). 
The second algorithm, called Hash-to-Min, is novel, and 
provably finishes in 0(log7i) rounds for path graphs. The 
proof technique used for Hash-to-Min is novel, but not tight, 
and our experiments show that it requires at most 2 log d 
rounds and 3(|F| + |£'|) communication per rounds. 

Both of our map-reduce algorithms iteratively merge over- 
lapping clusters to compute connected components. Low com- 
munication cost is achieved by ensuring that a single cluster 
is replicated exactly once, using tricks like pointer-doubling, 
commonly used in the PRAM literature. The more intricate 
problem is processing graphs for which connected components 
are so big that either (i) they do not fit in the memory of a 
single machine, and hence cause failures, or (ii) they result in 
heavy data skew with some clusters being small, while others 
being large. 

The above problems mean that we need to merge overlap- 
ping clusters, i.e. remove duplicate nodes occurring in multi- 
ple clusters, without materializing entire clusters in memory. 
Using Hash-to-Min, we solve this problem by maintaining 
each cluster as key-value pairs, where the key is a common 
cluster id and values are nodes. Moreover, the values are kept 
sorted (in lexicographic order), using a map-reduce capability 

-Measured as total number of c-bit messages communicated, where c is 
number of bits to represent a single node in the graph. 



called secondary-sorting, which incurs no extra computation 
cost. Intuitively, when clusters are merged by the algorithm, 
mappers individually get values (i.e, nodes) for a key, sort 
them, and send them to the reducer for that key. Then the 
reducer gets the 'merge-sorted' list of all values, corresponding 
to nodes from all clusters needing to be merged. In a single 
scan, the reducer then removes any duplicates from the sorted 
list, without materializing entire clusters in memory. 

We also present two novel map-reduce algorithms for 
single-linkage agglomerative clustering using similar ideas. 
One using Hash-to-All that provably completes in 0{\ogn) 
map-reduce rounds, and at most 0(n|y| + \E\) commu- 
nication per round, and other using Hash-to-Min that we 
conjecture completes in 0{logd) map-reduce rounds, and at 
most 0(|y| + |i?|) communication per round. We believe that 
these are the first Map-Reduce algorithm for single linkage 
clustering that finish in o{n) rounds. 

All our algorithms can be easily adapted to the Bulk Syn- 
chronous Parallel paradigm used by recent distributed graph 
processing systems like Pregel ||T6| and Giraph 0. We choose 
to focus on the map-reduce setting, since it is more impacted 
by a reduction in number of iterations, thus more readily 
showing the gains brought by our algorithms (see Sections HIl 
and I VI-Cl for a more detailed discussion). 

Contributions and Outline: 

• We propose two novel algorithms for connected compo- 
nents - (i) Hash-Greater-to-Min, which provably requires 
at most 3 log n rounds with high probability, and at most 
2{\V\ + \E\) communication per round, and (ii) Hash-to- 
Min, which we prove requires at most 4 log n rounds on 
path graphs, and requires 2\ogd rounds and 3{\V\ + \E\) 
communication per round in practice. (Section HITt 

• While Hash-Greater-to-Min requires connected compo- 
nents to fit in a single machine's memory, we propose 
a robust implementation of Hash-to-Min that scales with 
arbitrarily large connected components. We also describe 
extensions to Hash-to-Min for load balancing, and show 
that on large social network graphs, for which Hash- 
Greater-to-Min runs out of memory, Hash-to-Min still 
works efficiently. (Section HvTi 

• We also present two algorithms for single linkage ag- 
glomerative clustering using our framework: one using 
Hash-to-All that provably finishes in 0(log?i) map-reduce 
rounds, and at most 0(?i|y| + \E\) communication per 
round, and the other using Hash-to-Min that we again 
conjecture finishes in 0{\ogd) map-reduce rounds, and at 
most 0(|V^| + \E\) communication per round. (Section IVl 

• We present detailed experimental results evaluating our 
algorithms for connected components and clustering and 
compare them with previously proposed algorithms on 
multiple real-world datasets. (Section IVIt 

We present related work in Sec. |II] followed by algorithm 
and experiment sections, and then conclude in Sec. IVIII 

II. Related Work 

The problems of finding connected components and undi- 
rected s-t connectivity (USTCON) are fundamental and very 
well studied in many distributed settings including PRAM, 
MapReduce, and BSP We discuss each of them below. 



Name 



Pegasus 1:10] 

Zones (S) 

L Datalog Q] 

NL Datalog Q] 

PRAM [20l, [m, i), DD, (H 



Hash-Greater-to-Min 



# of steps 



Oid) 

0(d) 

O(logd) 

O(logn) 



3 log 71 



Communication 



0{\V\ + \E\) 
0{\V\ + \E\) 
0{n\V\ + \E\) 
0{n\V\ + \E\) 
shared memoiy^ 
2(1^1 + IEI)"' 



TABLE 1 

Complexity comparison with related work: n = # of nodes in 

LARGEST component, AND d = GRAPH DIAMETER 

A. Parallel Random Access Machine (PRAM) 

The PRAM computation model allows several processors to 
compute in parallel using a common shared memory. PRAM 
can be classified as CRCW PRAM if concurrent writes to 
shared memory are permitted, and CREW PRAM if not. 
Although, map-reduce does not have a shared memory, PRAM 
algorithms are still relevant, due to two reasons: (i) some 
PRAM algorithms can been ported to map-reduce by case-to- 
case analyses, and (ii), a general theoretical result |12] shows 
that any 0{t) CREW PRAM algorithm can be simulated in 
0{t) map-reduce steps. 

For the CRCW PRAM model, Shiloach and Vishkin 11201 
proposed a deterministic O(logn) algorithm to compute con- 
nected components, with n being the size of the largest com- 
ponent. Since then, several other O(logn) CRCW algorithms 
have been proposed in JS), ifTSl . ifTSl . However, since they 
require concurrent writes, it is not obvious how to translate 
them to map-reduce efficiently, as the simulation result of lfT2l 
applies only to CREW PRAM. 

For the CREW PRAM model, Johnson et. al. El pro- 
vided a deterministic 0(log ' n) time algorithm, which was 
subsequently improved to O(logn) by Karger et. al. ifTTI . 
These algorithms can be simulated in map-reduce using the 
result of IIT2I . However, they require computing all nodes 
at a distance 2 of each node, which would require 0{ti?) 
communication per map-reduce iteration on a star graph. 

Conceptually, our algorithms are most similar to the CRCW 
PRAM algorithm of Shiloach and Vishkin 1201. That algorithm 
maintains a connected component as a forest of trees, and 
repeatedly applies either the operation of pointer doubling 
(pointing a node to its grand-parent in the tree), or of hooking 
a tree to another tree. Krishnamurthy et al |[T3l propose 
a more efficient implementation, similar to map-reduce, by 
interleaving local computation on local memory, and parallel 
computation on shared memory. However, pointer doubling 
and hooking require concurrent writes, which are hard to 
implement in map-reduce. Our Hash-to-Min algorithm does 
conceptually similar but, slightly different, operations in a 
single map-reduce step. 



B. Map-reduce Model 

Google's map-reduce lecture series describes an iterative 
approach for computing connected components. In each itera- 
tion a series of map-reduce steps are used to find and include 
all nodes adjacent to current connected components. The 
number of iterations required for this method, and many of its 
improvements fS], IfTOl , IfTTI , is 0{d) where d is the diameter 
of the largest connected component. These techniques do not 
scale well for large diameter graphs (such as graphical models 
where edges represent correlations between variables). Even 



for moderate diameter graphs (with d — 20), our techniques 
outperform the 0{d) techniques, as shown in the experiments. 

Afrati et al IT] propose map-reduce algorithms for com- 
puting transitive closure of a graph - a relation containing 
tuples of pairs of nodes that are in the same connected 
component. These techniques have a larger communication per 
iteration as the transitive closure relation itself is quadratic 
in the size of largest component. Recently, Seidl et al ||T9l 
have independently proposed map-reduce algorithms similar 
to ours, including the use of secondary sorting. However, they 
do not show the 0(log n) bound on the number of map-reduce 
rounds. 

Table |T] summarizes the related work comparison and shows 
that our Hash-Greater-to-Min algorithm is the first map-reduce 
technique with logarithmic number of iterations and linear 
communication per iteration. 

C. Bulk Synchronous Parallel (BSP) 

In the BSP paradigm, computation is done in parallel by 
processors in between a series of synchronized point-to-point 
communication steps. The BSP paradigm is used by recent 
distributed graph processing systems like Pregel llT6l and 
Giraph [j4|. BSP is generally considered more efficient for 
graph processing than map-reduce as it has less setup and 
overhead costs for each new iteration. While the algorithmic 
improvements of reducing number of iterations presented in 
this paper are applicable to BSP as well, these improvements 
are of less significance in BSP due to lower overhead of 
additional iterations. 

However, we show that BSP does not necessarily dominate 
map-reduce for large-scale graph processing (and thus our 
algorithmic improvements for map-reduce are still relevant and 
important). We show this by running an interesting experiment 
in shared grids having congested environments in Sec. IVI-CI 

The experiment shows that in congested clusters, map- 
reduce can have a better latency than BSP, since in the latter 
one needs to acquire and hold machines with a combined 
memory larger than the graph size. For instance, consider a 
graph with a billion nodes and ten billion edges. Suppose each 
node is associated with a state of 256 bytes (e.g., the contents 
of a web page, or recent updates by a user in a social network, 
etc.). Then the total memory required would be about 256 GB, 
say, 256 machines with IG RAM. In a congested grid waiting 
for 256 machines could take much longer than running a map- 
reduce job, since the map-reduce jobs can work with a smaller 
number of mappers and reducers (say 50-100), and switch in 
between different MR jobs in the congested environment. 

III. Connected Components on Map-Reduce 

In this section, we present map-reduce algorithms for com- 
puting connected components. All our algorithms are instan- 
tiations of a general map-reduce framework (Algorithm [T), 
which is parameterized by two functions - a hashing function 
h, and a merging function m (see line 1 of Algorithm [T]i. 
Different choices for h and m (listed in Table |II| result in 
algorithms having very different complexity. 

Our algorithm framework maintains a tuple (key, value) 
for each node v of the graph - key is the node identifier v, 
and the value is a cluster of nodes, denoted Cy. The value 



1: Input: A graph G = {V, E), 
hashing function h 
merging function m, and 
export function EXPORT 
2: Output: A set of connected components C C 2^ 
3: Either Initialize Ci, = {v} Or C^ = {v} U nhrs{v) 

depending on the algorithm. 
4: repeat 

5: mapper for node v. 
6: Compute h{Cy), which is a collection of key-value pairs 

(u, Cu) for u e Cy. 
7: Emit all (u, C„) S h{Cy). 
8: reducer for node v. 
9: Let {Cv , . . . , Ci } denote the set of values received 

from different mappers. 
10: Set Cy ^ m{{ci,^\ .. . 
11: until Cy does not change for all v 
12: Return C = EXPORT(U„{Ct,}) 



,Cr^}) 



Algorithm 1: General Map Reduce Algorithm 



Hash-Min emits {v, Cy), and (u, {v„iin\) for all nodes u g nbrs(v). 
Hash-to- All emits (m, d,) for all nodes u g Cv 

Hash-to-Min emits (fmim Cy), and (u, {vmin}) for all nodes u g Cy. 
Hash-Greater-to-Min computes C>„, the set of nodes in Cy not less than 
V. It emits (Umim C>y), and (u, {vmin}) for all nodes u g C>„ 



TABLE II 

Hashing Functions: Each strategy describes the key- value 

PAIRS EMITTED BY MAPPER WITH INPUT KEY V AND VALUE Cy (llmin 
DENOTES SMALLEST NODE IN Cy) 

Cy is initialized as either containing only the node v, or 
containing v and all its neighbors nbrs{v) in G, depending 
on the algorithm (see line 3 of Algorithm [T). The framework 
updates Cy through multiple mapreduce iterations. 

In the map stage of each iteration, the mapper for a key v 
applies the hashing function h on the value Cy to emit a set 
of key-value pairs {u,Cu), one for every node u appearing 
in Cy (see lines 6-7). The choice of hashing function governs 
the behavior of the algorithm, and we will discuss different 
instantiations shortly. In the reduce stage, each reducer for a 
key V aggregates tuples {v,Cy ),...,(«, Ci ) emitted by 
different mappers. The reducer applies the merging function 
m over Cy to compute a new value Cy (see lines 9-10). 
This process is repeated until there is no change to any of 
the clusters Cy (see line 11). Finally, an appropriate EXPORT 
function computes the connected components C from the final 
clusters Cy using one map-reduce round. 

Hash Functions We describe four hashing strategies in Ta- 
ble |II] and their complexities in Table Hill The first one, denoted 
Hash-Min, was used in ifTOl . In the mapper for key v, Hash- 
Min emits key-value pairs {v,Cy) and {u,{vmin}) for all 
nodes u e nbrs{v). In other words, it sends the entire cluster 
Cy to reducer v again, and sends only the minimum node 
Vmin of the cluster Cy to all reducers for nodes u E nhrs{v). 
Thus communication is low, but so is rate of convergence, as 
information spreads only by propagating the minimum node. 
On the other hand, Hash-to-All emits key-value pairs 
{u,Cy) for all nodes u G Cy. In other words, it sends the 
cluster Cy to all reducers u G Cy. Hence if clusters Cy and Cy 
overlap on some node u, they will both be sent to reducer of u, 
where they can be merged, resulting in a faster convergence. 



Algorithm 


MR Rounds 


Communication 
(per MR step) 


Hash-Mill |10| 

Hash-to- All 

Hash-to-Min 

Hash-Greater-to-Min 


d 

logd 

O(logn)* 

31ogn 


0(\V\ + \E\) 

0{n\V\ + \E\) 

0{\ogn\V\ + \E\y 

2{\V\ + \E\) 



TABLE III 

Complexity of Different Algorithms (* denotes results hold 

for only path graphs) 

But, sending the entire cluster Cy to all reducers u E C„ re- 
sults in large quadratic communication cost. To overcome this, 
Hash-to-Min sends the entire cluster Cv to only one reducer 
I'min- while other reducers are just sent {w,nin}- This decreases 
the communication cost drastically, while still achieving fast 
convergence. Finally, the best theoretical complexity bounds 
can be shown for Hash-Greater-to-Min, which sends out a 
smaller subset C>„ of C„. We look at how these functions 
are used in specific algorithms next. 

A. Hash-Min Algorithm 

The Hash-Min algorithm is a version of the Pegasus al- 
gorithm ifTOl n In this algorithm each node v is associated 
with a label Wmm (i-e., C^ is a singleton set {v,nin}) which 
corresponds to the smallest id amongst nodes that v knows 
are in its connected component. Initially u,„in = v and 
so Cv = {v}. It then uses Hash-Min hashing function to 
propagate its label Vmin in C^ to all reducers u E nbrs{v) in 
every round. On receiving the messages, the merging function 
TO computes the smallest node v^^^^ amongst the incoming 
messages and sets Cy = {^'m'in}- Thus a node adopts the 
minimum label found in its neighborhood as its own label. On 
convergence, nodes that have the same label are in the same 
connected component. Finally, the connected components are 
computed by the following EXPORT function: return sets of 
nodes grouped by their label. 

Theorem 3.1 (Hash-Min IHOll): Algorithm Hash-Min cor- 
rectly computes the connected components of G = {V, E) 
using 0(1 1^1 + I-E I) communication and 0{d) map-reduce 
rounds. 

B. Hash-to-All Algorithm 

The Hash-to-All algorithm initializes each cluster Cy = 
{v}U{nbrs{v)}. Then it uses Hash-to-All hashing function to 
send the entire cluster Cy to all reducers u E Cy. On receiving 
the messages, merge function m updates the cluster by taking 
the union of all the clusters received by the node. More 
formally, if the reducer at v receives clusters Cy , . . . ,Ci, , 
then Cy is updated to uf^^Cy\ 

We can show that after log d map-reduce rounds, for every 
V, Cy contains all the nodes in the connected component 
containing v. Hence, the EXPORT function just returns the 
distinct sets in C (using one map-reduce step). 

Theorem 3.2 (Hash-to-All): Algorithm Hash-to-All cor- 
rectly computes the connected components of G = {V, E) 
using (5(n|V^| + |£'|) communication per round and logdmap- 
reduce rounds, where n is the size of the largest component 
and d the diameter of G. 

^ 1101 has additional optimizations that do not change the asymptotic 
complexity. We do not describe them here. 



Proof: We can show using induction that after k map- 
reduce steps, every node u that is at a distance < 2^ from v is 
contained in Cy. Initially this is true, since all neighbors are 
part of Cy. Again, for the fc + I''* step, u e Cy, for some w 
and w e Cy such that distance between u^w and w,v is at 
most 2^. Hence, for every node w at a distance at most 2^^+^ 
from V, u E Cy after fc + 1 steps. Proof for communication 
complexity follows from the fact that each node is replicated 
at most n times. ■ 

C. Hash-to-Min Algorithm 

While the Hash-to-All algorithm computes the connected 
components in a smaller number of map-reduce steps than 
Hash-Min, the size of the intermediate data (and hence 
the communication) can become prohibitively large for even 
sparse graphs having large connected components. We now 
present Hash-to-Min, a variation on Hash-to-All, that we show 
finishes in at most 41ogn steps for path graphs. We also 
show that in practice it takes at most 2 log d rounds and linear 
communication cost per round (see Section IVIK where d is the 
diameter of the graph. 

The Hash-to-Min algorithm initializes each cluster 
Cy = {v} U {nbrs{v)}. Then it uses Hash-to-Min hash 
function to send the entire cluster Cy to reducer Vmm, where 
Vmin is the Smallest node in the cluster Cy, and {vmm} to all 
reducers u E Cy. The merging function m works exactly like 
in Hash-to-All: Cy is the union of all the nodes appearing in 
the received messages. We explain how this algorithm works 
by an example. 

Example 3.3: Consider an intermediate step where clusters 
Ci = {1,2,4} and G5 = {3,4,5} have been associated with 
keys 1 and 5. We will show how these clusters are merged in 
both Hash-to-All and Hash-to-Min algorithms. 

In the Hash-to-All scheme, the mapper at 1 sends the entire 
cluster Gi to reducers 1, 2, and 4, while mapper at 5 sends 
G5 to reducers 3, 4, and 5. Therefore, on reducer 4, the entire 
cluster G4 = {1, 2, 3, 4, 5} is computed by the merge function. 
In the next step, this cluster G4 is sent to all the five reducers. 

In the Hash-to-Min scheme, the mapper at 1 sends Gi to 
reducer 1, and {1} to reducer 2 and 4. Similarly, the mapper 
at 5 sends Gg to reducer 3, and {3} to reducer 4 and 5. So 
reducer 4 gets {1} and {3}, and therefore computes the cluster 
G4 — {1,3} using the merge function. 

Now, in the second round, the mapper at 4, has 1 as the 
minimum node of the cluster G4 = {1, 3}. Thus, it sends {1} 
to reducer 3, which already has the cluster G2 = {3,4,5}. 
Thus after the second round, the cluster G3 = {1,3,4,5} is 
formed on reducer 3. Since 1 is the minimum for G3, the 
mapper at 3 sends G3 to reducer 1 in the third round. Hence 
after the end of third round, reducer 1 gets the entire cluster 
{1,2,3,4,5}. 

Note in this example that Hash-to-Min required three 
map-reduce steps; however, the intermediate data transmitted 
is lower since entire clusters Gi and G2 were only sent to 
their minimum element's reducer (1 and 3, resp). 

As the example above shows, unlike Hash-to-All, at the end 
of Hash-to-Min, all reducers v are not guaranteed to contain in 
Cy the connected component they are part of. In fact, we can 



show that the reducer at Vmin contains all the nodes in that 
component, where Vmin is the smallest node in a connected 
component. For other nodes v, C^ = {wmin}- Hence, EXPORT 
outputs only those C„ such that v is the smallest node in C^- 

Theorem 3.4 (Hash-to-Min Correctness): At the end of al- 
gorithm Hash-to-Min, Cy satisfies the following property: If 
Wmin is the smallest node of a connected component C, then 
C'«„i„ = C. For all other nodes v, C^ ~ {vmin}- 

Proof: Consider any node v such that C^ contains Wmin- 
Then in the next step, mapper at v sends C„ to I'min, and only 
{vniin} to V. After this iteration, C^ will always have Vmin as 
the minimum node, and the mapper at v will always send its 
cluster Ct, to Vmin- Now at some point of time, all nodes v in 
the connected component C will have w,„i,i G Cy (this follows 
from the fact that min will propagate at least one hop in every 
iteration just like in Theorem 13. It . Thus, every mapper for 
node V sends its final cluster to Wmin, and only retains Wmin- 
Thus at convergence C^^^^ = C and C^ = {vmin}- ■ 

Theorem 3.5 (Hash-to-Min Communication): 
Algorithm takes ©(fc-dV^I + l-El)) expected communication per 
round, where k is the total number of rounds. Here expectation 
is over the random choices of the node ordering. 

Proof: Note that the total communication in any step 
equals the total size of all Cy in the next round. Let Uk denote 
the size of this intermediate after k rounds. That is, n^ ~ 
Y,y C,,. We show by induction that Uk = 0{k ■ {\V\ + \E\)). 

First, jiQ =J2v^v — 2(1^1 + |i?|), since each node contains 
itself and all its neighbors. In each subsequent round, a node 
V is present in C„, for all u £ C„. Then v is sent to a different 
cluster in one of two ways: 

• If ?; is the smallest node in C„, then v is sent to all nodes in 
C„. Due to this, v gets replicated to |C„| different clusters. 
However, this happens with probability 1/|C„|. 

• If i; is not the smallest node, then v is sent to the smallest 
node of C„. This happens with probability 1 — l/|Cu|. 
Moreover, once i; is not the smallest for a cluster, it will 
never become the smallest node; hence it will never be 
replicated more that once. 

From the above two facts, on expectation after one round, the 
node V is sent to ,si = |C°| clusters as the smallest node 
and to nil = \Cy\ clusters as not the smallest node. After 
two rounds, the node v is additionally sent to S2 = \Cy\, 
TT^2 = \Cy\, in addition to the mi clusters. Therefore, after k 
rounds, Uk = 0{k ■ {\V\ + \E\)). ■ 

Next we show that on a path graph, Hash-to-Min finishes in 
4 log n. The proof is rather long, and due to space constraints 
appears in Sec. |A] of the Appendix. 

Theorem 3.6 (Hash-to-Min Rounds): Let G = {V, E) be a 
path graph (i.e. a tree with only nodes of degree 2 or 1). Then, 
Hash-to-Min correctly computes the connected component of 
G = (y, E) in 4 log n map-reduce rounds. 

Although, Theorem 13.61 works only for path graphs, we 
conjecture that Hash-to-Min finishes in 21og(i rounds on all 
inputs, with 0(|T^| + \E\) communication per round. Our 
experiments (Sec. IVIl i seem to validate this conjecture. 

D. Hash-Greater-to-Min Algorithm 

Now we describe the Hash-Greater-to-Min algorithm that 
has the best theoretical bounds: 31ogn map-reduce rounds 



with high probabiUty and 2(|F|h-|_E|) communication com- 
plexity per round in the worst-case. In Hash-Greater-to-Min 
algorithm, the clusters Cv are again initialized as {v}. Then 
Hash-Greater-to-Min algorithm runs two rounds using Hash- 
Min hash function, followed by a round using Hash-Greater- 
to-Min hash function, and keeps on repeating these three 
rounds until convergence. 

In a round using Hash-Min hash function, the entire cluster 
Ct, is sent to reducer v and Wmin to all reducers u G nbrs{v). 
For, the merging function m on machine m{v), the algo- 
rithm first computes the minimum node among all incoming 
messages, and then adds it to the message C{v) received 
from ra{v) itself. More formally, say w"^';™ is the smallest 
nodes among all the messages received by u, then Gnew{v) is 
updated to K^n } U {C{v)}. 

In a round using Hash-Greater-to-Min hash function, the 
set C>v is computed as all nodes in Gi, not less than v. 
This set is sent to reducer Wmin, where Wmin is the smallest 
node in G{v), and {wmin} is sent to all reducers u G C>u. 
The merging function m works exactly like in Hash-to-All: 
G{v) is the union of all the nodes appearing in the received 
messages. We explain this process by the following example. 

Example 3.7: Consider a path graph with n edges (1,2), 
(2,3), (3,4), and so on. We will now show three rounds of 
Hash-Greater-to-Min. 

In Hash-Greater-to-Min algorithm, the clusters are initial- 
ized as Gi = {i} for i G [1, n]. In the first round, the Hash-Min 
function will send {i} to reducers i— 1, i, and z + 1. So each 
reducer i will receive messages {i — 1}, {i} and {i + 1}, and 
aggregation function will add the incoming minimum, i — 1, 
to the previous Gi = {i}. 

Thus in the second round, the clusters are Ci = {1} and 
Gi = {i — l,i} for i G [2,n]. Again Hash-Min will send 
the minimum node {z — 1} of Ci to reducers i — 1, i, and 
i + 1. Again merging function would be used. At the end of 
second step, the clusters are Gi = {!}, C2 = {lj2}, Gi = 
{i — 2,1 — l,i, } for i G [3,n]. 

In the third round, Hash-Greater-to-Min will be used. 
This is where interesting behavior is seen. Mapper at 2 will 
send its C>(2) = {2} to reducer 1. Mapper at 3 will send 
its C>(3) = {3} to reducer 1. Note that C>(3) does not 
include 2 even though it appears in C3 as 2 < 3. Thus 
we save on sending redundant messages from mapper 3 to 
reducer 1 as 2 has been sent to reducer 1 from mapper 2. 
Similarly, mapper at 4 sends C>(4) = {4} to reducer 2, and 
mapper 5 sends C>(5) = {5} to reducer 3, etc. Thus we 
get the sets, Ci = {1,2,3}, C2 = {1,2,4}, C3 = {1,3,6}, 
C4 = {4, 5, 6}, and so on. 

The analysis of the Hash-Greater-to-Min algorithm relies on 
the following lemma. 

Lemma 3.8: Let Wmin be any node. Denote GT{vn-iin) the 
set of all nodes v for which Wmin is the smallest node 
in Gv after Hash-Greater-to-Min algorithm converges. Then 
GT{vmin) is precisely the set C>(Umin)- 
Note that in the above example, after 3 rounds of Hash- 
Greater-to-Min, GT{2) is {2,4} and C>(2) is also {2,4}. 

We now analyze the performance of this algorithm. The 
proof is based on techniques introduced in JS], lITTl . and 



omitted here due to lack of space. The proof appears in the 
Appendix. 

Theorem 3.9 (Complexity): Algorithm Hash-Greater- 

to-Min correctly computes the connected components 
of G = (V, E) in expected 3 log n map-reduce rounds 
(expectation is over the random choices of the node ordering) 
with 2(|V^| + |£'|) communication per round in the worst case. 

IV. Scaling the Hash-to-Min Algorithm 

Hash-to-Min and Hash-Greater-to-Min complete in less 
number of rounds than Hash-Min, but as currently described, 
they require that every connected component of the graph fit 
in memory of a single reducer We now describe a more ro- 
bust implementation for Hash-to-Min, which allows handling 
arbitrarily large connected components. We also describe an 
extension to do load balancing. Using this implementation, 
we show in Section [Vl] examples of social network graphs 
that have small diameter and extremely large connected com- 
ponents, for which Hash-Greater-to-Min runs out of memory, 
but Hash-to-Min still works efficiently. 

A. Large Connected Components 

We address the problem of larger than memory connected 
components, by using secondary sorting in map-reduce, which 
allows a reducer to receive values for each key in a sorted 
order Note this sorting is generally done in map-reduce to 
keep the keys in a reducer in sorted order, and can be extended 
to sort values as well, at no extra cost, using composite keys 
and custom partitioning ifTSl . 

To use secondary sorting, we represent a connected com- 
ponent as follows: if C„ is the cluster at node v, then we 
represent C^ as a graph with an edge from v to each of the 
node in C„. Recall that each iteration of Hash-to-Min is as 
follows: for hashing, denoting Vmin as the min node in C^, 
the mapper at v sends C^ to reducer Vmin, and {vmin} to 
all reducers u in C,,. For merging, we take the union of all 
incoming clusters at reducer v. 

Hash-to-Min can be implemented in a single map-reduce 
step. The hashing step is implemented by emitting in the 
mapper, key-value pairs, with key as Vmin, and values as 
each of the nodes in C^, and conversely, with key as each 
node in Ci,, and Vmin as the value. The merging step is 
implemented by collecting all the values for a key v and 
removing duplicates. 

To remove duplicates without loading all values in the 
memory, we use secondary sorting to guarantee that the values 
for a key are provided to the reducer in a sorted order Then 
the reducer can just make a single pass through the values 
to remove duplicates, without ever loading all of them in 
memory, since duplicates are guaranteed to occur adjacent to 
each other. Furthermore, computing the minimum node Vmin 
is also trivial as it is simply the first value in the sorted order. 

B. Load Balancing Problem 

Even though Hash-to-Min can handle arbitrarily large 
graphs without failure (unlike Hash-Greater-to-Min), it can 
still suffer from data skew problems if some connected com- 
ponents are large, while others are small. We handle this 
problem by tweaking the algorithm as follows. If a cluster C^ 



I: Input: Weighted graph G = {V,E,w), 

stopping criterion Stop. 

2: Output: A clustering C C 2^. 

3: Initialize clustering C -i— {{w}|w G V}; 

4: repeat 

5: Find the closest pair of clusters Ci, C2 in C (as per d); 

6: Update C^C-{Ci,C2}U{CiUC2)}; 

7: until C does not change or Stop{C) is true 

8: Return C 



Algorithm 2: Centralized single linkage clustering 



at machine v is larger than a predefined threshold, we send 
all nodes u < v to reducer Wmm and {vmin} to all reducers 
u < V, as done in Hash-to-Min. However, for nodes u > v, 
we send them to reducer v and {v} to reducer u. This ensures 
that reducer Vmin does not receive too many nodes, and some 
of the nodes go to reducer v instead, ensuring balanced load. 

This modified Hash-to-Min is guaranteed to converge in 
at most the number of steps as the standard Hash-to-Min 
converges. However, at convergence, all nodes in a connected 
component are not guaranteed to have the minimum node Wmm 
of the connected component. In fact, they can have as their 
minimum, a node v if the cluster at v was bigger than the 
specified threshold. We can then run standard Hash-to-Min, on 
the modified graph over nodes that correspond to cluster ids, 
and get the final output. Note that this increases the number 
of rounds by at most 2, as after load-balanced Hash-to-Min 
converges, we use the standard Hash-to-Min. 

Example 4.1: If the specified threshold is 1, then the modi- 
fied algorithm converges in exactly one step, returning clusters 
equal to one-hop neighbors. If the specified threshold is 00, 
then the modified algorithm converges to the same output 
as the standard one, i.e. it returns connected components. 
If the specified threshold is somewhere in between (for our 
experiments we choose it to 100,000 nodes), then the output 
clusters are subsets of connected components, for which no 
cluster is larger than the threshold. 

V. Single Linkage Agglomerative Clustering 

To the best of our knowledge, no map-reduce implemen- 
tation exists for single linkage clustering that completes in 
o{n) map-reduce steps, where ?7 is the size of the largest 
cluster We now present two map-reduce implementations for 
the same, one using Hash-to- All that completes in O(logn) 
rounds, and another using Hash-to-Min that we conjecture to 
finish in 0{\ogd) rounds. 

For clustering, we take as input a weighted graph denoted 
as G = (V, E, w), where w : i? — ?> [0, 1] is a weight function 
on edges. An output cluster G is any set of nodes, and a 
clustering C of the graph is any set of clusters such that each 
node belongs to exactly one cluster in C. 



A. Centralized Algorithm: 

Algorithm |2] shows the typical bottom up centralized algo- 
rithm for single linkage clustering. Initially, each node is its 
own cluster Define the distance between two clusters Gi , G2 
to be the minimum weight of an edge between the two clusters; 



1: Input: Weighted graph G = {V, E, w), 

stopping criterion Stop. 
2: Output: A clustering C C 2^. 
3: Initialize C = {{v} U nbrs{v)\v e V}. 
4: repeat 

5: Map: Use Hash-to-All or Hash-to-Min to hash clusters. 
6: Reduce: Merge incoming clusters. 
7: until C does not change or Stop{C) is true 
8: Split clusters in C merged incorrectly in the final iteration. 
9: Return C 



Algorithm 3: Distributed single linkage clustering 



I.e., 



d(Ci,C2) 



min w{e) 

e—{u,v),u€Ci.v^C2 



In each step, the algorithm picks the two closest clusters and 
merges them by taking their union. The algorithm terminates 
either when the clustering does not change, or when a stopping 
condition. Stop, is reached. Typical stopping conditions are 
threshold stopping, where the clustering stops when the closest 
distance between any pair of clusters is larger than a threshold, 
and cluster size stopping condition, where the clustering stops 
when the merged cluster in the most recent step is too large. 
Next we describe a map-reduce algorithm that simulates 
the centralized algorithm, i.e., outputs the same clustering. If 
there are two edges in the graph having the exact same weight, 
then single linkage clustering might not be unique, making it 
impossible to prove our claim. Thus, we assume that ties have 
been broken arbitrarily, perhaps, by perturbing the weights 
slightly, and thus no two edges in the graph have the same 
weight. Note that this step is required only for simplifying 
our proofs, but not for the correctness of our algorithm. 

B. Map-Reduce Algorithm 

Our Map-Reduce algorithm is shown in Algorithm |3] In- 
tuitively, we can compute single-linkage clustering by first 
computing the connected components (since no cluster can 
lie across multiple connected components), and then splitting 
the connected components into appropriate clusters. Thus, 
Algorithm |3] has the same map-reduce steps as Algorithm [T] 
and is implemented either using Hash-to-All or Hash-to-Min. 

However, in general, clusters, defined by the stopping 
criteria. Stop, may be much smaller than the connected 
components. In the extreme case, the graph might be just one 
giant connected component, but the clusters are often small 
enough that they individually fit in the memory of a single 
machine. Thus we need a way to check and stop execution as 
soon as clusters have been computed. We do this by evaluating 
StopiC) after each iteration of map-reduce. If Stop{C) is 
false, a new iteration of map-reduce clustering is started. If 
Stop{C) is true, then we stop iterations. 

While the central algorithm can implement any stopping 
condition, checking an arbitrary predicate in a distributed 
setting can be difficult. Furthermore, while the central al- 
gorithm merges one cluster at a time, and then evaluates 
the stopping condition, the distributed algorithm evaluates 
stopping condition only at the end of a map-reduce iteration. 
This means that some reducers can merge clusters incorrectly 



in the last map-reduce iteration. We describe next how to 
stop the map-reduce clustering algorithm, and split incorrectly 
merged clusters. 

C. Stopping and Splitting Clusters 

It is difficult to evaluate an arbitrary stopping predicate in a 
distributed fashion using map-reduce. We restrict our attention 
to a restricted yet frequently used class of local monotonic 
stopping criteria, which is defined below. 

Definition 5.1 (Monotonic Criterion): Stop is monotone if 
for every clusterings C, C , if C refines C (i.e, VC G C ^ 
3C' eC, C <Z C), then Stop{C) ^ 1 ^ StopiC) = 1. 

Thus monotonicity implies that stopping predicate continues 
to remain true if some clusters are made smaller Virtually 
every stopping criterion used in practice is monotonic. Next 
we define the assumption of locality, which states that stopping 
criterion can be evaluated locally on each cluster individually. 

Definition 5.2 (Local Criterion): Stop is local if there ex- 
ists a function Stop^^^^i : 2^ -^ {0, 1} such that Stop{C) = 1 
iff^top;„,,;(C) = lforallCeC. 

Examples of local stopping criteria include distance- 
threshold (stop merging clusters if their distance becomes too 
large) and size-threshold (stop merging if the size of a cluster 
becomes too large). Example of non-local stopping criterion 
is to stop when the total number of clusters becomes too high. 

If the stopping condition is local and monotonic, then we 
can compute it efficiently in a single map-reduce step. To 
explain how, we first define some notations. Given a cluster 
C QV, denote Gc the subgraph of G induced over nodes C. 
Since C is a cluster, we know Gc is connected. We denote 
tree{C) as thqj minimum weight spanning tree of Gc, and 
split{C) as the pair of clusters Cli Cr obtained by removing 
the edge with the maximum weight in tree{C). Intuitively, 
Cl and Cn are the clusters that get merged to get C in the 
centralized single linkage clustering algorithm. Finally, denote 
nbrs{C) the set of clusters closest to C by the distance metric 
d, i.e. if Cl £ nbrs{C), then for every other cluster C2, 
d(C,C2) >rf(C,Ci). 

We also define the notion of core and minimal core decom- 
position as follows. 

Definition 5.3 (Core): A singleton cluster is always a core. 
Furthermore, any cluster C C V is a core if its split Cl , Cr 
are both cores and closest to each other, i.e. Cl G nbrs{Cii) 
and Cr G nbrs{CL)- 

Definition 5.4 (Minimal core decomposition): Given a 

cluster C its minimal core decomposition, MCD{C), is a set 
of cores {Ci, C2, ■ • • , C;} such that UiCi = C and for every 
core C C C there exists a core Ci in the decomposition for 
which C C C,. 

Intuitively, a cluster C is a core, if it is a valid, i.e., it is 
a subset of some cluster C" in the output of the centralized 
single linkage clustering algorithm, and MCD{C) finds the 
largest cores in C, i.e. cores that cannot be merged with any 
other node in C and still be cores. 

Computing MCD We give in Algorithm |4l a method to 
find the minimal core decomposition of a cluster It checks 
whether the input cluster is a core. Otherwise it computes 
cluster splits C/, and Cr and computes their MCD recursively. 

^The tree is unique because of unique edge weights 



, Ci } corresponding 



Input: Cluster C QV. 

Output: A set of cores {Ci,C2 

to MCD{C). 

If C is a core, return {C}. 

Construct the spanning tree Tc of C, and compute Cl , Cr 

to be the cluster split of C. 

Recursively compute MCD{Cl) and MCD{Cr). 

Return MCD{Cl) U MCD{Cr) 



Algorithm 4: Minimal Core Decomposition MCD 



1: Input: Stopping predicate Stop, Clustering C. 

2: Output: Stop{C) 

3: For each cluster C G C, compute MCD{C). (performed 

in reduce of calling of Algo. |3]l 
4: Map: Run Hash-to-All on cores, i.e, hash each core Ci E 

MCD{C) to all machines in{u) for u & Ci 
5: Reducer for node v: Of all incoming cores, pick the 

largest core, say, Cy, and compute Stopig^^i{Cy) . 
6: Return A^^ev Stopi^^^i{Cy) 



Algorithm 5: Stopping Algorithm 

Note that this algorithm is centralized and takes as input 
a single cluster, which we assume fits in the memory of a 
single machine (unlike connected components, graph clusters 
are rather small). 

Stopping Algorithm Our stopping algorithm, shown in Algo- 
rithm |5] is run after each map-reduce iteration of Algorithmic 
It takes as input the clustering C obtained after the map- 
reduce iteration of Algorithm |3] . It starts by computing the 
minimal core decomposition, MCD{C), of each cluster C in 

C. This computation can be performed during the reduce step 
of the pervious map-reduce iteration of Algorithm |3] Then, 
it runs a map-reduce iteration of its own. In the map step, 
using Hash-to-All, each core Ci is hashed to all machines 
m{u) for u e Ci. In reducer, for machine ■m{v), we pick 
the incoming core with largest size, say Cy. Since Stop is 
local, there exists a local function Stopi^^^i. We compute 
Stopi^^^iiCv) to determine whether to stop processing this 
core further Finally, the algorithm stops if all the cores for 
nodes v in the graph are stopped. 

Splitting Clusters If the stopping algorithm (Algorithm |5) 
returns true, then clustering is complete. However, some 
clusters could have merged incorrectly in the final map-reduce 
iteration done before the stopping condition was checked. Our 
recursive splitting algorithm. Algorithmic] correctly splits such 
a cluster C by first computing the minimal core decomposi- 
tion, MCD{C). Then it checks for each core C» S MCD{C) 
that its cluster splits Ci and Cr could have been merged by 
ensuring that both Stopig^^i{Ci) and Stopig^^i{Cr) are false. 
If that is the case, then core Ci is valid and added to the 
output, otherwise the clusters Ci and Cr should not have been 
merged, and Ci is split further 

D. Correctness & Complexity Results 

We first show the correctness of Algorithm |3] For that we 
first show the following lemma about the validity of cores. 

Lemma 5.5 (Cores are valid): Let C central be the output of 
Algorithm|2] and C be any core (defined according to Def. 15.3) 



1 

2 
3 


Input: Incorrectly merged cluster C w.rt Stopi^^ai- 
Output: Set S of correctly split clusters in C. 
Initialize S* = {}. 


4 
5 


for C, in MCD{C) do 

Let Cl and Cr be the cluster splits of d. 


6 

7 


if Stopi^^^iiCi) and Stopi^caiiCr) are false then 

S^SiJC,. 


8 


else 


9 
10 


S^S\J Split{Ci) 
end if 


11 


end for 


12 


Return S. 



Algorithm 6: Recursive Splitting Algorithm Split 



such that its clusters splits Cu Cr have both Stopig^aiiC^i) and 
Stopi^^^i{Cr) as false. Then C is valid, i.e. Algorithmic does 
compute C some time during its execution, and there exists a 

cluster C central in C central SUCh that C C C central- 

Proof: The proof uses induction. For the base case, 
note that any singleton core is obviously valid. Now assume 
that C has cluster splits Ci and Cr, which by induction 
hypothesis, are valid. Then we show that C is also valid. Since 
Stopi^^^i{Ci) and Stopi^^^i{Cr) are false for the cluster splits 
of C, they do get merged with some clusters in Algorithm |2] 
Furthermore, by definition of a core, Ci , Cr are closest to each 
other, hence they actually get merged with each other Thus 
C = Cl U Cr is constructed some during execution of 
Algorithmic and there exists a cluster C central in its output 
that contains C; U C^ = C, completing the proof. ■ 

Next we show the correctness of Algorithm |3] Due to lack 
of space, the proof is omitted and appears in the Appendix. 

Theorem 5.6 (Correctness): The distributed Algorithm |3] 
simulates the centralized Algorithmic ie., it outputs the same 
clustering as Algorithmic 

Next we state the complexity result for single linkage 
clustering. We omit the proof as it is very similar to that of 
the complexity result for connected components. 

Theorem 5.7 (Single-linkage Runtime): If Hash-to-All is 
used in Algorithmic then it finishes in 0(\og7i) map-reduce 
iterations and 0(ri|y| + \E\) communication per iteration, 
where n denotes the size of the largest cluster 

We also conjecture that if Hash-to-Min is used in Algo- 
rithmic then it finishes in 0(\ogd) steps. 

VI. Experiments 

In this section, we experimentally analyze the performance 
of the proposed algorithms for computing connected compo- 
nents of a graph. We also evaluate the performance of our 
agglomerative clustering algorithms. 

Datasets: To illustrate the properties of our algorithms we use 

both synthetic and real datasets. 
• Movie: The movie dataset has movie listings collected 
from Y! Movie^ and DBpediqj- Edges between two 
fistings correspond to movies that are duplicates of one 
another; these edges are output by a pairwise matcher 
algorithm. Listings maybe duplicates from the same or 



http://movies.yahoo.com/movie/*/info| 



' http://dbpedia.org/] 
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different source, hence the sizes of connected components 
vary. The number of nodes \V\ = 431, 221 (nearly 430K) 
and the number of edges \E\ = 889, 205 (nearly 890K). 
We also have a sample of this graph (238K nodes and 459K 
edges) with weighted edges, denoted /Wow'el/K which we 
use for agglomerative clustering experiments. 
B\z: The biz dataset has business listings coming from 
two overlapping feeds that are licensed by a large internet 
company. Again edges between two businesses correspond 
to business listings that are duplicates of one another Here, 
\V\ = 10, 802, 777 (nearly 10.8M) and |£:| = 10, 197, 043 
(nearly 10. 2M). We also have a version of this graph 
with weighted edges, denoted BizW, which we use for 
agglomerative clustering experiments. 
Social: The Social dataset has social network edges 
between users of a large internet company. Social has 
\V\ = 58552777 (nearly 58M) and \E\ = 156355406 
(nearly 156M). Since social network graphs have low 
diameter, we remove a random sample of edges, and 
generate SocialSparse. With |£:|= 15,638,853 (nearly 
15M), SocialSparse graph is more sparse, but has much 
higher diameter than Social. 

Twitter. The Twitter dataset (collected by Cha et al ID) 
has follower relationship between twitter users. Twitter has 
\V\ = 42069704 (nearly 42M) and \E\ = 1423194279 
(nearly 1423M). Again we remove a random sample of 



edges, and generate a more sparse graph, TwItterSparse, 
with |£:|= 142308452 (nearly 142M). 
• Synth: We also synthetically generate graphs of a vary- 
ing diameter and sizes in order to better understand the 
properties of the algorithms. 

A. Connected Components 

Algorithms: We compare Hash-Min, Hash-Greater-to-Min, 
Hash-to-All, Hash-to-Min and its load-balanced version 
Hash-to-Min* (Section |IV| ). For Hash-Min, we use the open- 
source Pegasus implementatiorQ which has several optimiza- 
tions over the Hash-Min algorithm. We implemented all other 
algorithms in Pig3 on Hadoojo There is no native support 
for iterative computation on Pig or Hadoop map-reduce. We 
implement one iteration of our algorithm in Pig and drive a 
loop using a python script. Implementing the algorithms on 
iterative map-reduce platforms like HaLoop [[2j and Twister Q 
is an interesting avenue for future work. 

1) Analyzing Hash-to-Min on Synthetic Data: We start by 
experimentally analyzing the rounds complexity and space 
requirements of Hash-to-Min. We run it on two kinds of 
synthetic graphs: paths and complete binary trees. We use 
synthetic data for this experiment so that we have explicit 



* http://www.es. emu. edu/~pegasus/| 
'http://pig.apache.org/] 
" jhttp://hadoop.apache.org | 
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Fig. 3. Comparison of Pegasus and our algorithms on real datasets. 
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control over parameters d, \V\, and \E\. Later we report the 
performance of Hash-to-Min on real data as well. We use path 
graphs since they have largest d for a given \V\ and complete 
binary trees since they give a very small d = log \V\. 

For measuring space requirement, we measure the largest 
intermediate data size in any iteration of Hash-to-Min. Since 
the performance of Hash-to-Min depends on the random 
ordering of the nodes chosen, we choose 10 different random 
orderings for each input. For number of iterations, we report 
the worst-case among runs on all random orderings, while for 
intermediates data size we average the maximum intermediate 
data size over all runs of Hash-to-Min. This is to verify our 
conjecture that number of iterations is 2\ogd in the worst- 
case (independent of node ordering) and intermediate space 
complexity is (9(|V^| + |i?|) in expectation (over possible node 
orderings). 

For path graphs, we vary the number of nodes from 32 (2^) 
to 524,288 (2i3). In Figure [T(a)l we plot the number of 



iterations (worst-case over 10 runs on random orderings) with 
respect to log d. Since the diameter of a path graph is equal 
to number of nodes, d varies from 32 to 524, 288 as well. As 
conjectured the plot is linear and always lies below the line 
corresponding to 2 log d. In Figure |l(b)[ we plot the largest 
intermediate data size (averaged over 10 runs on random 
orderings) with respect to \V\ + \E\. Note that both x-axis 
and y-axis are in log-scale. Again as conjectured, the plot is 
linear and always lies below 3{\V\ + \E\). 

For complete binary trees, we again vary the number of 
nodes from 32 (2^) to 524, 288 (2^^). The main difference 
from the path case is that for a complete binary tree, diameter 
is 2 log(|F|) and hence the diameter varies only from 10 to 38. 
Again in Figure |2(a)[ we see that the rounds complexity still 
lies below the curve for 2 log d supporting our conjecture even 
for trees. In Figure |2(b)| we again see that space complexity 
grows linearly and is bounded by 3(|T^| + \E\). 
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TABLE IV 
ComparisonofPegasus,Hash-to-Min, Hash-Greater-to-Min, AND Hash-to-All ON the Group I datasets. Time is averaged over 4 

RUNS AND rounded TO MINUTES. OPTIMAL TIMES APPEAR IN BOLD: IN ALL CASES EITHER HASH-TO-MIN OR HASH-TO- ALL IS OPTIMAL. 
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TABLE V 
Comparison OF Pegasus andtheHash-to-Min* algorithm on Group II datasets. Time is averaged over 4 runs and rounded to 

MINUTES. Optimal times appear in bold: in all cases, Hm* is optimal. 
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Fig. 4. 



(a) Runtimes (in minutes) (b) # of Map-Reduce jobs 

Comparison of Hash-to-All and Hash-to-Min for single linkage clustering on BizW and MovieW. 



2) Analysis on Real Data: We next compared Hash-to- 
Min, Hash-Greater-to-Min, and Hash-to-All algorithms on real 
datasets against Pegasus |10:|. To the best of our knowledge, 
Pegasus is the fastest technique on MapReduce for computing 
connected components. Although all datasets are sparse (have 
average degree less than 3), each dataset has very different 
distribution on the size n of the largest connected components 
and graph diameter d. We partition our datasets into two 
groups - group I with d >~ 20 and relatively small n, and 
group II with d < 20 and very large n. 

Group I: Graphs with large d and small n: This group 
includes Biz, Movie, and SocialSparse datasets that have 
large diameters ranging from 20 to 80. On account of large 
diameters, these graphs requires more MR jobs and hence 
longer time, even though they are somewhat smaller than the 
graphs in the other group. These graphs have small connected 
components that fit in memory. 

Each connected component in the Biz dataset represents 
the number of duplicates of a real-world entity. Since there 
are only two feeds creating this dataset, and each of the two 
feeds is almost void of duplicates, the size of most connected 
components is 2. In some extreme cases, there are more 
duplicates, and the largest connected component we saw had 
size 93. The Movie dataset has more number of sources. 



and consequently significantly more number of duplicates. 
Hence the size of some of the connected components for it is 
significantly larger, with the largest containing 17,213 nodes. 
Finally, the SocialSparse dataset has the largest connected 
component in this group, with the largest having 2,945,644 
nodes. Table |IV] summarizes the input graph parameters. It 
also includes the number of map-reduce jobs and the total 
runtime for all of the four techniques. 

Differences in the connected component sizes has a very 
interesting effect on the run-times of the algorithm as shown 
in Figures |3(a)| and |3(b)| On account of the extremely small 
size of connected components, runtime for all algorithms is 
fastest for the Biz dataset, even though the number of nodes 
and edges in Biz is larger than the Movie dataset. Hash-to-All 
has the best performance for this dataset, almost 3 times faster 
than Pegasus. This is to be expected as Hash-to-All just takes 
4 iterations (in general it takes logd iterations) to converge. 
Since connected components are small, the replication of 
components, and the large intermediate data size does not 
affect its performance that much. We believe that Hash-to-All 
is the fastest algorithm whenever the intermediate data size is 
not a bottleneck. Hash-to-Min takes twice as many iterations 
(21ogd in general) and hence takes almost twice the time. 
Finally, Pegasus takes even more time because of a larger 
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10:45 
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7:40 


7:40 
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17:03 


11:30 


11:20 


8:05 


10 


19:10 


12:46 


23:39 


15:49 


15 


28:17 


26:07 


64:51 


43:37 



TABLE VI 

Median and maximum completion times (in min:sec) for k 

CONNECTED COMPONENT JOBS DEPLOYED SIMULTANEOUSLY USING 
MAP-REDUCE (MR-K) AND GiRAPH (BSP-K) 

number of iterations, and a larger number of map-reduce jobs. 
For the Movie and SocialSparse datasets, connected com- 
ponents are much larger Hence Hash-to-All does not finish 
on this dataset due to large intermediate data sizes. However, 
Hash-to-Min beats Pegasus by a factor of nearly 3 in the 
SocialSparse dataset since it requires a fewer number of 
iterations. On movies, the difference is the most stark: Hash- 
to-Min has 15 times faster runtime than Pegasus again due to 
significant difference in the number of iterations. 

Group II: Graphs with small d and large n: This group 
includes Social, TwitterSparse, and Twitter dataset that have 
a small diameter of less than 20, and results are shown in 
Figures |3(c)| and |3(d)| and Table |V] Unlike Group I, these 
datasets have very large connected components, such that even 
a single connected component does not fit into memory of a 
single mapper Thus we apply our robust implementation of 
Hash-to-Min (denoted Hash-to-Min*) described in Sec. |IV] 

The Hash-to-Min* algorithm is nearly twice as fast as 
pegasus, owing to reduction in the number of MR rounds. 
Only exception is the Twitter graph, where reduction in times 
is only 18%. This is because the Twitter graph has some nodes 
with very high degree, which makes load-balancing a problem 
for all algorithms. 

B. Single Linkage Clustering 

We implemented single linkage clustering on map-reduce 
using both Hash-to-All and Hash-to-Min hashing strategies. 
We used these algorithms to cluster the MovieW and BIzW 
datasets. Figures |4(a)| and [4(b)| shows the runtime and number 
of map-reduce iterations for both these algorithms, respec- 
tively. Analogous to our results for connected components, for 
the MovieW dataset, we find that Hash-to-Min outperforms 
Hash-to-All both in terms of total time as well as number 
of rounds. On the BizW dataset, we find that both Hash-to- 
Min and Hash-to-All take exactly the same number of rounds. 
Nevertheless, Hash-to-All takes lesser time to complete that 
Hash-to-Min. This is because some clusters (with small n) 
finish much earlier in Hash-to-All; finished clusters reduce the 
amount of communication required in further iterations. 

C. Comparison to Bulk Synchronous Parallel Algorithms 

Bulk synchronous parallel (BSP) paradigm is generally con- 
sidered more efficient for graph processing than map-reduce 
as it has less setup and overhead costs for each new iteration. 
While the algorithmic improvements of reducing number of 
iterations presented in this paper are important independent 
of the underlying system used, these improvements are of 
less significance in BSP due to low overhead of additional 
iterations. 



In this section, we show that BSP does not necessarily 
dominate Map-Reduce for large-scale graph processing (and 
thus our algorithmic improvements for Map-Reduce are still 
relevant and important). We show this by running an interest- 
ing experiment in shared grids having congested environments. 
We took the Movie graph and computed connected compo- 
nents using Hash-to-Min (map-reduce, with 50 reducers) and 
using Hash-MiiO on Giraph 10) (BSP, with 100 mappers), 
an open source implementation of Pregel |fT6) for Hadoop. 
We deployed k = 1, 5, 10, and 15 copies of each algorithm 
(denoted by MR-k and BSP-k), and tracked the maximum and 
median completion times of the jobs. The jobs were deployed 
on a shared Hadoop cluster with 454 map slots and 151 reduce 
slots, and the cluster experienced normal and equal load from 
other unrelated tasks. 

Table [VTl summarizes our results. As expected, BSP-1 out- 
performs MR-l{3 unlike map-reduce, the BSP paradigm does 
not have the per-iteration overheads. However, as k increases 
from 1 to 15, we can see that the maximum and median 
completion times for jobs increases at a faster rate for BSP-k 
than for MR-k. This is because the BSP implementation needs 
to hold all 100 mappers till the job completes. On the other 
hand, the map-reduce implementation can naturally parallelize 
the map and reduce rounds of different jobs, thus eliminating 
the impact of per round overheads. So it is not surprising that 
while all jobs in MR- 15 completed in about 20 minutes, it 
took an hour for jobs in BSP-15 to complete. Note that the 
cluster configurations favor BSP implementations since the 
reducer capacity (which limits the map-reduce implementation 
of Hash-to-Min) is much smaller (< 34%) than the mapper 
capacity (which limits the BSP implementation of Hash-Min). 
We also ran the experiments on clusters with higher ratios 
of reducers to mappers, and we observe similar results (not 
included due to space constraints) showing that map-reduce 
handles congestion better than BSP implementations. 

vn. Conclusions and Future Work 

In this paper we considered the problem of find connected 
components in a large graph. We proposed the first map- 
reduce algorithms that can find the connected components 
in logarithmic number of iterations - (i) Hash-Greater-to- 
Min, which provably requires at most 31ogn iterations with 
high probability, and at most 2(|V^| + |i?|) communication per 
iteration, and (ii) Hash-to-Min, which has a worse theoreti- 
cal complexity, but in practice completes in at most 21ogcf 
iterations and 3(|T^| + \E\) communication per iteration; n 
is the size of the largest component and d is the diameter 
of the graph. We showed how to extend our techniques to 
the problem of single linkage clustering, and proposed the 
first algorithm that computes a clustering in provably 0{logn) 
iterations. 
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A. Proof of Theorem 15.61 

We first restate Theorem 13 .6 1 below. 

Theorem A. 1 f |3.6| ).- Let G = [ViE) be a path graph (i.e. a tree with only nodes of degree 2 or 1). Then, Hash-to-Min 
correctly computes the connected component of G = iy,E) in 41ogn map-reduce rounds. 

To prove the above theorem, we first consider path graphs when node ids increase from left to right. Then we show the 
result for path graphs with arbitrary ordering. 

Lemma A.2: Consider a path with node ids increasing from the left to right. Then after k iterations of the Hash-to-min 
algorithm, 

« For every node j within a distance of 2*^ from the minimum node m, m knows j and j knows m. 

• For every pair of nodes i,j that are a distance 2'° apart, i knows j and j knows k. 

• Node j is not known to and does not know any node i that is at a distance > 2*^. 
Proof: The proof is by induction. 

Base Case: After 1 iterations, each node knows about its 1-hop and 2-hop neighbors (on either side). 

Induction Hypothesis: Suppose the claim holds after fc — 1 iterations. 

Induction Step: 

In the fc*'' iteration, consider a node j that is at a distance d from the min node m, where 2''~^ < d < 2''. From the induction 

hypothesis, there is some node i that is 2'"""^ away from j that knows j. Since, rn is known to i (from induction hypothesis), 

the Hash-to-min algorithm would send j to m and m to j in the current iteration. Therefore, m knows j and m is known to j. 

Consider a node j that is > 2'' distance from the min node. At the end of the previous iteration, j knew (and was known 
to) i, and i knew and was known to i' - where i and i' are at distance 2*^^^ from j and i respectively. Moreover, i did not 
know any node i" smaller than i'. Therefore, in the current iteration, i sends i' to j and j to i'. Therefore, j knows and is 
known to a node that is 2'"' distance away. 

Finally, we can show that a node does not know (and is not known to) any node that is distance > 2'' as follows. Node j 
can only get a smaller node i' if i' is a minimum at some node i. Since in the previous step no one knows a node at distance 
> 2*^^^, j cannot know a node at distance > 2'"'. ■ 

Now we extend the proof for arbitrary path graphs. Denote mink{u) the minimum node after fc iterations that u knows that 
also knows u. Also denote A(u,v) the distance between node u and v. 

Definition A.3 (Local Minima): A node v is local minimum if all its neighbors have id larger than u's id. 

For a path graph, we define the notion of levels below. 

Definition A.4 (Levels): Given a path, level consists of all nodes in the path. Level i is then defined recursively as nodes 
that are local minimum nodes among the nodes at level « — 1, if the level i — 1 nodes are arranged in the order in which they 
occur in the path. Denote the set of nodes at level i as level{i). 

Proposition A.5: The number of levels having more than 1 node is at most \ogn. 

Proof: The proof follows from the fact that for each level i, no consecutive nodes can be local minimum. Hence \level{i)\ < 
\level{i -l)/2). ■ 

Lemma A.6: Consider a path P with three segments Pi, P2 and P3, where Pi and P3 are arbitrary, and P2 has r + 1 level 
£ nodes li,l2, ■■■lr,'mi going from left to right. Assume that labels are such that li < I2 < ■ ■ ■ < Ir- For a node li, denote 
l{li, k,£) the closest level £ node Ij from li towards the right such that inink{lj) > h- Denote T{k,£) and M{k,i) as 

and 

M{k,£) = minp,,nin,,{h)>mink-i{mi) OT min,,{mi)>mink-i{li}^{h , mi) 

Then the following is true: 

1) T satisfies the following recurrence realtion 

T(fc,€) > min{T{k~2,£) + min{T{k~ l,£),M{k,£),M{k ~ 1,£ - 1)) 

2) M satisfies the following recurrence relation 

M{k,£) > min{T{k - 2, i), M{k - 1, £ ~ 1)) 

Proof: Denote by [k, lu] the level £ nodes between k and /„, and [k, lu) those nodes except for /„. 
Proof of claim 1: Let Is — l{li,k — 1,£) and let /( be the level £ node just to the left of Is- We know by definition of 

Is — l{li,k— 1,£) that raink-i{ls) > li, but for all I G [li,lt],mink-i{l) < h- Now there are three cases: 

1) mink^iih) G P3. Then mink~i{lt) < h (from above), and any node I G [lt,lr] would have mink~i{l) G P3 < 
mink-iih)- Hence minfc_i(^) < k for all I e {k, . . . , Ir}. Thus l(/i, fc — 1) = Ir- 

2) mink~i{lt) ^ P3 and /S.{li, It) > T{k — 1, £). Denote /„ — \{lt, fc — 1, £). Consider any / e [It, lu)- Let a = mink-i{l)- 
Then from the definition of /„ = l{lt,k — 1) and since I £ [lt,lu), we know that, a ~ mink-i{l) < h- Now there are 
two sub-cases: 
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a) For all / G [h, lu), a ~ raink-i{l) ^ P3 U {toi}. In this case, we show that \{li, k, £) is a level £ node in [/„, /^j. 
For this we will show that for all / e [k, lu) and a = mink-i{l), mink-i{a) < U. If a <^ P3 U {mi}, then either (i) 
a £ Pi, in which case a < li (otherwise mink~i{l) would be li and not a), and hence obviously mink-i{a) < U, 
or (ii) a £ P2 but a ^ mi, and then since a < k (from above), a £ [k, k]. Hence mink-i{a) < li. 
In other words, after fc — 1 iterations, the minimum for any node I £ [hju) is a, which in turn has a minimum 
b < k. Thus in one Hash-to-Min step, b would become raink{l)- Hence after k iterations and for any local minimum 
I between U and /„, we have mink{l) < li- This shows that l(/i, fc, £) £ [/„, lr\- Then either l^ is Ip or the following 



holds. 



AG,;,l(/„ fc,£)) > /\{UJt) + MltJu) > T{k -2J)+ T{k - 1,£ - 1) 



b) There exists I £ [It, lu), such that a = mink~i{l) £ P3. Let /„ = l{li, fc, £). Then l^ is the first node in [k, Ir] for 
which mink{lw) > U- If ^HvJw) < M{k,£) then by definition after fc iterations mink{liu) < mink~~i{lt) < 'i- 
Thus A{ly,,k) > M{k,£). Thus 

A(/„l(?„fc,£))> A(/„/t)+AGi,/^)>T(fc-2,^) + M(fc,^) 

3) mink-lilt) £ Pi\\P2 and A{k,lt) < T{k- 2,£). In this case, we argue that A{lt,ls) > M{k- l,£- 1). Assume the 
contrary: A{lt, Is) < M{k -~2,£— 1). Since, A{li, It) < T{k~ 2,£), we know that mink-2{lt) < h- Since /( and Is are 
consecutive level £ nodes, all the level £ — 1 nodes between them are ordered. Hence by definition of M{k — 1,£ ~ 1), 
and the fact that A{lt,ls) < M{k — 1,£— 1), mink-i{ls) < rniuk-zih) < h ■ This contradicts the assumption that 

Is =\{l„k- \,£). Thus A{lt,ls) > M{k -1,£-1). 

Aik,\{k,k)) > A{k,lt) + A{lt,ls) > Mik-1,£-1) 
Combining the above cases we complete the proof of claim 1 . 

Proofofclaim2:If A(?i,/r) < r(fc-2,^), thenmmfe_2(/r) < /i. Alsoif A(/r,mi) < A/(fc-l,£-l), thenmmfe_i(mi) < 
mmfc_2(/r) < ^1- If both A{ll,lr) < T{k — 2,£) and A{lr,mi) < M{k~ 1,£— 1), then viink-i{mi) < li. Hence by claim 
1, mink{m) < mink-i{li)- This completes the proof. 

■ 

Lemma A.7: Let T{k,£) be the quantity as defined in Lemma IA!61 Then T{k,£) > 2'=/2-^. 
Proof: We prove the lemma using induction. 

Base Cases: (i) £ ^ and k > I. Then by Lemma^j\ T{k, 0)>2^> 2'=/2-o. (ii) fc = 1 and £ > 1. For any level £ > I, 
T{1,£)> l>2i/2-f. 

Induction Hypothesis (IH) For all fco < fc - 1 and 4 < £ - 1, T{ko, £0) > 2'=«/2-^o 

Induction Step: By Lemma lA. 61 we know that: 



/2-e 



T{k,£) > min{T{k-l,£) + T{k-2,£),T{k-2,£~l)) 

> mzn(2('=-i)/2-^ + 2('=-2)-^2('=-2)/2-^+i) (using IH) 

> mm (2'=/2-^(l/2 + I/V2), 2'=/^-^) > 2''/^ 



Finally, we can complete the proof of Theorem 13.61 Since T{k, I) is less than the length of the path, we know that T{k, I) < n 
Now from Prop. I A. 5l the number of levels having more than 1 node is at most logji. Hence £ < logn. Finally, from Lemma IaT 
we know that T{k,£) > 2^1"^-^. Thus 2'^/2-i°s'i < 2*^^/2-^ < „,. Thus fc < 4 logn. This completes the proof of Theorem 1X6] 



B. Proof of Theorem 13.91 

We first restate Theorem 13 .9 1 below. 

Theorem A.8 ( 15.91 ).- Algorithm Hash-Greater-to-Min correctly computes the connected components of G = iy, E) in 
expected 3 logn map-reduce rounds (expectation is over the random choices of the node ordering) with 2{\V\ + \E\) 
communication per round in the worst case. 

Proof: After 3fc rounds, denote Mk — {m,in{Cv) : v £ V} the set of nodes that appear as minimum on some node. For 
a minimum node m £ Mk, denote GTk{m) the set of all nodes v for which m = m,in{Cy). Then by Lemma [3781 we know 
that GTk{m) = C>(m) after 3fc rounds. Obviously U,neAhGTk{m) = V and for any m,m' £ Mk, GTk{m) n GTk{m') = 0. 

Consider the graph Ga4 with nodes as Mk and an edge between m £ Mk to m' £ Mk if there exists v £ GTk (m.) and 
v' £ GTk{m!) such that v,v' are neighbors in the input graph G. If a node m, has no outgoing edges in Gm^^ th^n GTk{m,) 
forms a connected component in G disconnected from other components, this is because, then for all v' ^ GTk{m,), there 
exists no edge to w G GTk{m). 
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We can safely ignore such sets GTk{m). Let MCk be the set of nodes in Ga/^ that have at least one outgoing edge. Also if 
m S MCk has an edge to m' < m in Gm,. , then m will no longer be the minimum of nodes v e GTk [m) after 3 additional 
rounds. This is because there exist nodes v G GTk{m) and v' e GTk{rn'), such that v and v' are neighbors in G. Hence in 
the first round of Hash-Min, v' will send in' to v. In the second round of Hash-Min, v will send m' to m. Hence finally m 
will get m! , and in the round of Hash-Greater-to-Min, m will send GTk{m) to m! . 

If |MCfe| = I, W.L.O.G, we can assume that they are labeled 1,2, . . . , I (since only relative ordering between them matters 
anyway). For any set, GTk{m), the probability that it its min m' G {1/2,1] after 3 more rounds is 1/4. This is because that 
happens only when m £ [{1/2,1) and all its neighbors m' e G7\/j, are also in {1/2,1]. Since there exist at least one neighbor 
m', the probability of m' G {1/2, 1] is at most 1/2. Hence the probability of any node v having a min m' E {1/2, 1] after 3 
more rounds is 1/4. 

Now since no set, GTk{m), ever get splits in subsequent rounds, the expected number of cores is 31/4 after 3 more rounds. 
Hence in three rounds of Hash-Greater-to-Min, the expected number of cores reduces from I to 31/4, and therefore it will 
terminate in expected 3 log ?7 time. 

The communication complexity is 2(|y| + |i?|) per round in the worst-case since the total size of clusters is J^v ^>v ~ 2(1^1)- 



C. Proof of Theorem 15.61 

We first restate Theorem 15 .6 1 below. 

Theorem A. 9 (^^: The distributed Algorithm |3] simulates the centralized Algorithm [2] i.e., it outputs the same clustering 
as Algorithmic 

Proof: Let Gcentrai be the clustering output by Algorithm |2] Let Gdistributed be the clustering output by Algorithm |3] 
We show the result in two parts. 

First, for any cluster Gdistributed G Gdistributed^ there exists a cluster Gcentrai G Gcentrai such that Gdistributed ^ Gcentrai- 

Since Algorithm [3] uses the splitting algorithmic] it outputs only cores having cluster splits Gi, Gr for which Stopi^^^i{Gi) and 
Stopi^^^i{Gr) equal false. Thus we can invoke Lemma |53] on Gdistributed to prove that Gdistributed is valid, and the existence 

of (^central SUCh that Gdistributed ^ Gcentrai- 

Having shown that Gdistributed C Gcentrai, wc now show that, in fact, Gdistributed = Gcentrai- Assumc the contrary, i.e. 
Gdistributed C Gcentrai- Sincc Gdistributed IS Valid, cvcn the Centralized algorithm constructed Gdistributed some time during 
its execution, and then merged it with some other cluster, say C^g„jy.ai- 

Since Gdistributed IS in the output of Algorithmic then three cases are pospsible: (i) Gdistributed forms a connected component 
by itself, disconnected from the rest of the graph, or (ii) Stopi^^^i{G distributed) is true, and the algorithm stops because 
of the stopping condition, or (iii) Stopi^caA^ distributed) is false, but it merges with some cluster G'^^^^^^l,^^^^ for which 
^^^Piocai{^'distributed) ^^ ^^^- ^" '^c first two cascs, cvcn the centralized algorithm can not merge Gdistributed with any other 
cluster, contradicting that Gdistributed C G central- 
Vox case (iii), we show below that in fact C^i^tri&«ted ^ ^'central- Since the central algorithm merges C^e„t,.„, with Gcentrai, 
Stopi^cai{C'centrai) ^as to be falsc. Sincc Stopi^^^i is monotonic, and C^,,t^,f,„ted C C^,„t,.„„ 5'top;„,„;(C^,^^,^,^j,„j^^) has to 
be false as well, contradicting the assumption made in case (iii). Thus we proved all three cases are impossible, contradicting 

our assumption of Gdistributed C Gcentrai- HcnCC Gdistributed = Gcentral- 

Now we show that G[l^^^„^^^^^ C C^,„i^„;. Both C^,,t„,,„4,d and C^,„i™, have to be closest to Gdistrtbuted, i.e. in 
nbrs{G distributed), in order to get merged with it in either the central or distributed algorithms. Denote v to be the node, such 
that the singleton cluster {«} is in nbrs{G distributed)- Hence, by the property of single linkage clustering, both G'j^^^^^^i^^^^j^ 
and C^g„j^(j; must contain the node v. Since C^istribuied ^^ valid, there must be a cluster in the central algorithm's output 
containing it. Finally, since clusters in the output have to be disjoint, the cluster containing G'^^^^^^^^^^^ has to be C^ent^aZ' 

and thus C^i^tri&uted ^ ^central- ■ 



